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Abstract. Chandrasekaran, Parrilo and Willsky (2010) proposed a convex optimization problem to characterize graphical 
model selection in the presence of unobserved variables. This convex optimization problem aims to estimate an inverse covariancc 
matrix that can be decomposed into a sparse matrix minus a low-rank matrix from sample data. Solving this convex optimization 
problem is very challenging, especially for large problems. In this paper, we propose two alternating direction methods for solving 
this problem. The first method is to apply the classical alternating direction method of multipliers to solve the problem as a 
consensus problem. The second method is a proximal gradient based alternating direction method of multipliers. Our methods 
exploit and take advantage of the special structure of the problem and thus can solve large problems very efficiently. Global 
convergence result is established for the proposed methods. Numerical results on both synthetic data and gene expression data 
show that our methods usually solve problems with one million variables in one to two minutes, and are usually five to thirty 
five times faster than a state-of-the-art Newton-CG proximal point algorithm. 
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1. Introduction. In this paper, wc consider alternating direction methods with the theoretical guar- 
antee of global convergence for computing the latent- variable graphical model selection [7] . Graphical model 
selection is closely related to the inverse covariance matrix estimation problem, which is of fundamen- 
tal importance in multivariate statistical inference. In particular, when data X = (X\,--- ,X p )' follow 
a p-dimensional joint normal distribution with some unknown variance matrix S, the precision matrix 
= can be directly translated into a Gaussian graphical model. The zero entries in the precision 
matrix = (@ij) 1<i ■< precisely capture the desired conditional independencies in the Gaussian graphical 
model [SllQjj], i.e. 6ij = if and only if Xi X Xj\ X_uj\. The Gaussian graphical model has been suc- 
cessfully used to explore complex systems consisting of Gaussian random variables in many research fields, 
including gene expression genomics 22. 55], image processing [33], macroeconomics determinants study [13] , 
and social study [TJ [3D] ■ 

Nowadays, massive high-dimensional data are being routinely generated with rapid advances of modern 
high-throughput technology (e.g. microarray and functional magnetic resonance imaging). Estimation of a 
sparse graphical model has become increasingly important in the high-dimensional regime, and some well- 
developed penalization techniques have received considerable attention in the statistical literature. [40] was 
the first to study the high-dimensional sparse graphical model selection problem, and they proposed the 
neighborhood penalized regression scheme which performs the lasso [51j to fit each neighborhood regression 
and summarizes the sparsity pattern by aggregation via union or intersection. |44] proposed the joint 
sparse regression model to jointly estimate all neighborhood lasso penalized regressions. [58] considered the 
Dantzig selector [6] as an alternative to the lasso in each neighborhood regression. [5] proposed a constrained 
l\ minimization estimator called CLIME for estimating sparse precision matrices, and established rates of 
convergence under both the entrywise norm and the Frobcnius norm. Computationally CLIME can be 
further decomposed into a series of vector minimization problems. 
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The £i-penalized maximum normal likelihood method is another popular method for graphical model 
selection [59l [2j [21] [47] . [47] established its rate of convergence under the Frobenius norm. Under the 
irrcprcsentable conditions, |46j obtained the convergence rates under the entrywise ^ 00 norm and the spectral 
norm. Define the entrywise l\ norm of S as the sum of absolute values of the entries of S, i.e., \\S\\i := 
For a given sample covariance matrix E £ W xp , the £i-penalized maximum normal likelihood 
estimation can be formulated as the following convex optimization problem. 

(1.1) mm (S,t) -logdct5 + p||5||i, 

where the first part (S, S) — log det S gives the normal log-likelihood function of S 1 , and the entrywise i\ norm 
115111 is used to promote the sparsity of the resulting matrix. Note that in the literature the £i-penalized 
maximum normal likelihood usually uses the so-called 1-off absolute penalty || -S 1 1| i,off '■= ^2i^j However, 
||5||i and ||5||i, ff cause no difference when using our algorithm. [7J have used \\S\\i in defining their convex 
optimization problem and hence we follow their convention in the current paper. 

The aforementioned Gaussian graphical model selection methods were proposed under the ideal setting 
without missing variables. The recent paper by [7J considered a more realistic scenario where the full data 
consist of both observed variables and missing (hidden) variables. Let X px i be the observed variables. 
Suppose that there are some hidden variables V rx i (r <C p) such that [X, Y) jointly follow a multivariate 
normal distribution. Denote the covariance matrix by £(x,y) an d the precision matrix by 0(x,y)- Then we 
can write S(x,y) = [Ex, ^xr; £yx, £y] and Qtx,Y) = [©Xj ©xr; ®yx, ®y}- Given the hidden variables 
Y, the conditional concentration matrix of observed variables, 6j, is sparse for a sparse graphical model. 
However, the marginal concentration matrix of observed variables, S^- 1 = 0x — @XY©y ©yx 5 might not 
be a sparse matrix but a difference between the sparse term 0x and the low-rank term ©x:K©y ©FX- 
The problem of interest is to recover the sparse conditional matrix Ox based on observed variables X. [7J 
accomplished this goal by solving a convex optimization problem under the assumption that S^- 1 = S — L 
for some sparse matrix S and low-rank matrix L. The low rank assumption on L holds naturally since r 
is much less than p. Motivated by the success of the convex relaxation for rank-minimization problem, [7J 
introduced a regularized maximum normal likelihood decomposition framework called the latent variable 
graphical model selection (LVGLASSO) as follows. 

(1.2) min (5'-L,Sx)-logdct(5'-L) + a||S , ||i+/3Tr(L), s.t. S - LyQ,LhO, 

where Ex is the sample covariance matrix of X and Tr(L) denotes the trace of matrix L. In the high- 
dimensional setting, [7] established the consistency theory for (11.21) concerning its recovery of the support 
and sign pattern of S and the rank of L. 

Solving the convex optimization problem (|1.2[) is very challenging, especially for large problems. [7J 
considered (|1.2j) as a log-determinant semidefinite programming (SDP) problem, and used a Newton-CG 
based proximal point algorithm (LogdetPPA) proposed by [52] to solve it. However, LogdetPPA does not 
take advantage of the special structure of the problem, and we argue that it is inefficient for solving large- 
scale problems. To illustrate our point, let us consider the special case of (|1.2[) with L = 0, and then the 
latent variable graphical model selection (|1.2|) exactly reduces to the Gaussian graphical model selection 
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(jl.ip . Note that (jl.ip can be rewritten as 

min max — log det S + (Ex + W, X) , 

S ||W||oo<p 

where || W||oo is the largest absolute value of the entries of U. The dual problem of Ql.ip can be obtained by 
exchanging the order of max and min, i.e., 

max min — log det X + (Ex + W, S) , 

||W||oo<P s 

which is equivalent to 

(1.3) max {logdetVF+p: || W - ± x |U < p}- 

Both the primal and the dual graphical Lasso problems (jl.ip and (| 1 .3[) can be viewed as semidcfinite pro- 
gramming problems and can be solved via interior point methods (IPMs) in polynomial time j4j. However, 
the per-iteration computational cost and memory requirements of an IPM are prohibitively high for (jl.ip 
and (jl.3p . especially when the size of the matrix is large. Customized SDP based methods such as the ones 
studied in [52] and [32] require a reformulation of the problem that increases the size of the problem and 
thus makes them impractical for solving large-scale problems. Therefore, most of the methods developed for 
solving (jl.ip and (|1.3p are first-order methods. These methods include block coordinate descent type meth- 
ods [21 [2U 2H1 [S5], projected gradient method [T5] and variants of Nesterov's accelerated method [TTJ [55] , 
Recently, alternating direction methods have been applied to solve (jl.ip and shown to be very effective 

[Sol S3- 

In this paper, we propose two alternating direction type methods to solve the latent variable graphical 
model selection. The first method is to apply the alternating direction method of multipliers to solve this 
problem. This is due to the fact that the latent variable graphical model selection can be seen as a special 
case of the consensus problem discussed in [3]. The second method we propose is an alternating direction 
method with proximal gradient steps. To apply the second method, we first group the variables into two 
blocks and then apply the alternating direction method with one of the subproblems being solved inexactly 
by taking a proximal gradient step. Our methods exploit and take advantage of the special structure of 
the problem and thus can solve large problems very efficiently. Although the convergence results of the 
proposed methods are not very different from the existing results for alternating direction type methods, we 
still include the convergence proof for the second method in the appendix for completeness. We apply the 
proposed methods to solving problems from both synthetic data and gene expression data and show that 
our method outperform the state-of-the-art Newton-CG proximal point algorithm LogdetPPA significantly 
on both accuracy and CPU times. 

The rest of this paper is organized as follows. In Section [21 we give some preliminaries on alternating 
direction method of multipliers and proximal mappings. In Section G2 we propose solving LVGLASSO (jl.2p 
as a consensus problem using the classical alternating direction method of multipliers. We propose the 
proximal gradient based alternating direction method for solving (|1.2p in Section 0] In Section [SJ we apply 
our alternating direction method to solving (|1.2p using both synthetic data and gene expression data. We 
draw some conclusions in Section [6] 
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2. Preliminaries. Problem (|1.2I) can be rewritten in the following equivalent form by introducing a 
new variable R: 



min (i? ) Sx)-logdeti? + a||5|| 1 +^Tr(L) 
s.t. R = S- L,R >- 0,L h 0, 

which can be further reduced to 

(2.2) mm(R,t x )-logdetR + a\\S\\ 1 +l3Tr(L)+I(Ly0), s.t. R - S + L = 0, 

where the indicator function 1(L y 0) is defined as 



(2.3) X(L t 0) 



0, if L h 
+oo, otherwise. 



Note that we have dropped the constraint R >- since it is already implicitly imposed by the log dct R 
function. 

Now since the objective function involves three separable convex functions and the constraint is simply 
linear, Problem (|2.2p is suitable for alternating direction method of multipliers (ADMM). ADMM is closely 
related to the Douglas-Rachford and Peaceman-Rachford operator-splitting methods for finding zero of the 
sum of two monotone operators that have been studied extensively in [141 l43l l34l [TBI [T8l OH [10] . ADMM 
has been revisited recently due to its success in the emerging applications of structured convex optimization 
problems arising from image processing, compressed sensing, machine learning, semidefinite programming 
and statistics etc. (see e.g., (2H [HI [531 EZl 123 US [501 HH1 123 IH1 ISH O 

Problem (|2.2[) is suitable for alternating direction methods because the three convex functions involved 
in the objective function, i.e., 

(2.4) f(R):=(R,t x )-logdetR, 



(2.5) g(S) := a\\S\U, 
and 

(2.6) h{L) := (3Tr(L) + 1(L h 0), 

have easy proximal mappings. Note that the proximal mapping of function c : M. mxn — > R mx ™ for given 
£ > and Z £ R mx ™ is defined as 

(2.7) prox(c, f , Z) := argmin XeRmx „ \\X - Z\\ 2 F + c{X). 
The proximal mapping of f(R) defined in (|2.4p is 

(2.8) prox(/,^, Z) := argmin fl 1-\\R - Z\\% + (R, t x ) - logdeti?. 



The first-order optimality conditions of (|2.8p are given by 

(2.9) R + itx-Z-^R- 1 = 0. 
It is easy to verify that 

(2.10) R := ?7diag(7)[/ T , 

satisfies (|2.9[) and thus gives the optimal solution of (|2.8|) . where {/ diag(<r)[/ T is the eigenvalue decomposition 
of matrix ££a" — 2 and 



(2.11) 7< = I — crj + y cr? + 4£j /2, Vi = 1, . . . ,p. 

Note that (|2.1ip guarantees that the solution of (|2.8[) given by (|2.10p is a positive definite matrix. The 
proximal mapping of g(S) defined in (|2.5p is 

(2.12) prox( 5 ,C,Z) :=argmin s - Z\\% + a\\S\\i. 

It is well known that (|2.12[) has a closed-form solution that is given by the i\ shrinkage operation 

S k+1 := Shrink(Z,0, 

where Shrink(-, •) is defined as 

{Zij - T, if Zij > T 
Zij + t, if Zij <-t 
0, if - t < Z^ < t. 

The proximal mapping of h(L) defined in ()2.6[) is 

(2.14) prox^Z) :=axgmm L ±\\L - Z\\% + /3Tr(L) +1(L h 0). 
It is easy to verify that the solution of (|2.14j) is given by 

(2.15) L := [/diag( 7 )[/ T , 

where Z = U diag(<r)[/ T is the eigenvalue decomposition of Z and 7 is given by 

(2.16) 7i := maxjcr, - £/3,0}, i = l,...,p. 

Note that (|2.16p guarantees that L given in (|2.15[) is a positive semidcfinite matrix. 



The discussions above suggest the following natural ADMM for solving (|2.2[) be efficient. 




argmin^, C^R, S k , L k - A k ) 
argmin s £ M (i? fc+1 , S, L k ;A k ) 
argmin^ £ M (i? fc+1 , S k+1 , L; A k ) 
A k - (R k+1 -S k+1 +L k+1 )/fi, 



(2.17) 



where the augmented Lagrangian function is defined as 

(2.18) C„(R,S,L;A) := (iJ,Ejc)-Iogdeti2+a||5||i+j8Ti-(i)+I(i h 0)-(A,R-S+L) + ^-\\R-S+L\\ 2 F , 

A is the Lagrange multiplier and /i > is the penalty parameter. Note that the three subproblcms in p. 171) 
correspond to the proximal mappings of /, g and h defined in (|2.4[) . (|2.5[) and (|2.6[) . respectively. Thus they 
are all easy to solve. However, the global convergence of ADMM (|2.17|) with three blocks of variables was 
ambiguous. Only until very recently, was it shown that (|2.17[) globally converges under certain conditions 
(see |36j ) . It should be noted, however, that the error bound condition required in |36j is strong and only a 
few classes of convex function are known that satisfy this condition. 

3. ADMM for Solving (|2.2[) as a Consensus Problem. Problem (|2.2[) can be rewritten as a convex 
minimization problem with two blocks of variables and two separable functions as follows: 



where X = (R,S,L),Z = (R, S, L), and 

<t>{X) := f(R) + g(S) + h(L), $(Z) =1(R-S + L = 0), 

with f,g and h defined in (j2~4"|) . (|23|) and (|2~B|) . respectively. The ADMM applied to solving (j3~Tj) can be 

described as follows: 



where A is the Lagrange multiplier associated with the equality constraint. The two subproblcms in (|3.2[) 
are both easy to solve. In fact, the solution of the first sub-problem in p.2j) corresponds to the proximal 
mappings of /, g and h. Partitioning the matrix T k := X k+1 — /j,A k into three blocks in the same form as 
Z = (R, S, L), The second subproblem can be reduced to: 



(3.1) 



mm 



s.t. 



X - Z = 0, 



(3.2) 




(3.3) 



nun 



s.t. 



i||(i?,5,i)-(T£,T!,T L fc )||!, 
R- S + L = 0. 



The first-order optimality conditions of (|3.3p are given by 



(3.4) 



(R, s, L) - {T k , T k , T k ) - (r, -r, r) = o, 
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where L is the Lagrange multiplier associated with (|3.3p . Thus we get, 



r = 2$ : + r, 5 = t| - r, £ = t£ + r. 

Substituting them into the equality constraint in Q3.3p . we get 
(3.5) T = -(T k -T k + T k )/3. 

By substituting Q3.5P into (|3.4p we get the solution to p~3J). 

The ADMM (|3.2p solves Problem (|3.1[) with two blocks of variables. It can be seen as a special case of 
the consensus problem discussed in [3J. The global convergence result of (|3.2p has also been well studied in 
the literature (see e.g., [TrJl ITS] ). 



4. A Proximal Gradient based Alternating Direction Method. In this section, we propose 
another alternating direction type method to solve (|2.2p . In Section [31 we managed to reduce the original 
problem with three blocks of variables (|2.2p to a new problem with two blocks of variables (|3.ip . As a 
result, we can use ADMM for solving problems with two blocks of variables, whose convergence has been 
well studied. Another way to reduce the problem (|2.2p into a problem with two blocks of variables is to 
group two variables (say S and L) as one variable. This leads to the new equivalent form of (|2.2p : 

min f(R) + <p(W) 
s.t. R - [I, -I]W = 0, 

where W = [S; L] and tp(W) = g(S) + h(L). Now the ADMM for solving ([3~H can be described as 

{ R k+i ._ argmin H /(i?) - (A fe ,i?- [7,-/]V^ fc ) + [/, -/]iy fc |j|, 

W k+i ._ argminM/ ^(^) - (A fc , i? fe+1 - [7, -I]W) + ^ ||i? fe+1 - [I, -I]Wf F 
A fc+i . = A fc _ (i? fe+i _ [ 7j 

where A is the Lagrange multiplier associated with the equality constraint and \x > is a penalty parameter. 
The first subproblem in (|4.2p is still easy and it corresponds to the proximal mapping of function /. However, 
the second subproblem in (|4.2p is not easy, because the two parts of W are coupled together in the quadratic 
penalty term. To overcome this difficulty, we solve the second subproblem in (|4.2j) inexactly by one step of 
a proximal gradient method. Note that the second subproblem in (|4.2p can be reduced to 



(4.3) W k+1 := argmin w <p(W) + - -I\W - fiA k \\ 2 F . 
One step of proximal gradient method solves the following problem 

(4.4) min <p{W) + ^\\W - (W k + r (*) (R k +* - [I, -I]W k - ^ k ))\\ 2 F . 

Since the two parts of W = [S;L] are separable in the quadratic part now, (|4.4I) reduces to two problems 

(4.5) min g(S) + ^-\\S - (S k + rG k R )\\ 2 F , 
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and 



(4.6) mm h(L) + ^~\\L- (L k - rG k R )\\ 2 F , 

where G k R = R k+1 — S k +L k — /j,A k . Both (|4.5[) and (|4.6[) are easy to solve as they correspond to the proximal 
mappings of functions g and h, respectively. Thus, our proximal gradient based alternating direction method 
(PGADM) can be summarized as 



Algorithm 1 A Proximal Gradient based Alternating Direction Method 

1: for k=0,l,. . . do 

2: R k+1 :=argmin fl f(R) - <A fc , R - S k + L k ) + ± \\R - S k + L k \\% 
3: S k+1 :=argmin s g(S) + £.\\S - (S k + rG k R )f F 
4: L k+1 := argmin^ h(L) + £.\\L - (L k - rG k R )\\ 2 F 
5: A fc+1 := A fc - (R k+1 - S k+1 + L k+1 )/fi 

6: end for 



Remark 4.1. The idea of incorporating proximal step into the alternating direction method of multipliers 
has been suggested by JiTf and |2J/. This idea has then been generalized by \28f to allow varying penalty and 
proximal parameters. Recently, this technique has been used for sparse and low-rank optimization problems 
(see \57\[ and \50jj ). More recently, some convergence properties of alternating direction methods with proximal 
gradient steps have been studied by \36Tj , [20] and I38j. However, for the seek of completeness, we include 
a global convergence proof for Algorithm]]^ in the Appendix. 

Remark 4.2. In Algorithm]]^ we grouped S and L as one block of variable. We also implemented the 
other two ways of grouping the variables, i.e., group R and S as one block, and group R and L as one block. 
We found from the numerical experiments that these two alternatives yielded similar practical performance 
as Algorithm^ 

Remark 4.3. If we use the 1-off absolute penalty \\S\\i, r '■= ^i^j t° replace ||<S|[i, our algorithm 
basically remains the same except that we modify Shrink(-, •) as follows 



(4.7) 



[Shrink(Z, t)] 



Zu , if i — j 

Z{j — t, if i ^ j and Zij > r 

Zij + t, if i ^ j and Zij < — r 

0, if i 7^ j and — r < Z^ < t. 



5. Numerical experiments. In this section, we present numerical results on both synthetic and real 
data to demonstrate the efficiency of the proposed methods: ADMM (|3.2p and PGADM (Algorithm [1]). Our 
codes were written in MATLAB. All numerical experiments were run in MATLAB 7.12.0 on a laptop with 
Intel Core 15 2.5 GHz CPU and 4GB of RAM. 

We first compared ADMM (|3.2[) with PGADM (Algorithm [lj on some synthetic problems. We compared 
ADMM and PGADM using two different ways of choosing [i. One set of comparisons used a fixed [i = 10, 
and the other set of comparisons used a continuation scheme to dynamically change pi. The continuation 
scheme we used was to set the initial value of [i as the size of the matrix p, and then multiply /i by 1/4 after 
every 10 iterations. 

We then compared the performance of PGADM (with continuation on fj,) with LogdetPPA proposed by 
[52] and used in [7] for solving (f2~2]) . 

8 



5.1. Comparison of ADMM and PGADM on Synthetic Data. We observed form the numerical 
experiments that the step size r of the proximal gradient step in PGADM (Algorithm [T]) can be slightly 
larger than 1/2 and the algorithm produced very good results. We thus chose the step size t to be 0.6 in 
our experiments. 

We randomly created test problems using a procedure proposed by [H] and [25] for the classical graphical 
lasso problems. Similar procedures were used by |52j and |32j . For a given number of observed variables 
p and a given number of latent variables p^, we first created a sparse matrix U £ ^(p+Ph)x(p+Ph) w ith 
sparsity around 10%, i.e., 10% of the entries are nonzeros. The nonzero entries were set to -1 or 1 with 
equal probability. Then we computed K :~ (U * U T )~ 1 as the true covariance matrix. We then chose 
the submatrix of A, S := K(l : p, 1 : p) as the ground truth matrix of the sparse matrix S and chose 
L := K(l : p,p+ 1 : p + ph)K(p + 1 : p+ph,p + 1 : P+Ph)^ 1 K(p+ 1 : p+ph, 1 : p) as the ground truth matrix 
of the low rank matrix L. We then drew N = bp iid vectors, Yj., . . . , Y/v, from the Gaussian distribution 
A/"(0, (S — L)^ 1 ) by using the mvnrnd function in MATLAB, and computed a sample covariance matrix of 
the observed variables Ex := -k J2iLi ^i^i "■ 

We computed the relative infeasibility of the sequence (R k , S , L k ) generated by inexact ADMM using 

fK ^ . ( \\R k -S k + L k \\ F 

(5.1) micas 



m ax {l,\\R"\\ F ,\\S"\\ F ,\\L''\\ F y 

In the comparison of ADMM and PGADM, the size of all problems was chosen as p = 1000. For fixed 
/t = 10, we first ran the ADMM for 100 iterations, and recorded the objective function value and infeas. We 
then ran PGADM until it achieves an objective function value within relative error 10~ J compared with the 
objective function value given by ADMM, or it achieves an infeas within relative error 10 -5 compared with 
the infeas given by ADMM. The number of iterations, CPU times, infeas and objective function values 
for both ADMM and PGADM were reported in Table EU From Table [5Tj we see that for fixed [i = 10, 
ADMM was faster than PGADM when a and /3 are both small, and PGADM was faster than ADMM when 
a and /3 are both large. 

Table 5.1 

Comparison of ADMM with PGADM (for fixed fi) on synthetic data 





PGADM 


ADMM 


a /3 


obj 


iter 


cpu 


infeas 


obj 


iter 


cpu 


infeas 


0.005 0.025 


-1.6987e+002 


135 


224.1 


7.6e-005 


-1.7098e+002 


100 


172.9 


6.6e-005 


0.005 0.05 


-9.3385c+001 


146 


243.7 


3.8c-005 


-9.3476c+001 


100 


173.7 


2.9e-005 


0.01 0.05 


-4.4748c+001 


132 


214.7 


1.7c-005 


-4.4761e+001 


100 


180.1 


7.0e-006 


0.01 0.1 


5.4571e+001 


111 


177.4 


9.8c-006 


5.4567e+001 


100 


166.9 


2.8e-007 


0.02 0.1 


1.1881e+002 


83 


136.8 


9.7c-006 


1.1881e+002 


100 


170.1 


1.9e-007 


0.02 0.2 


2.3717e+002 


56 


91.5 


9.3c-006 


2.3717c+002 


100 


174.0 


1.6e-007 


0.04 0.2 


3.2417c+002 


44 


67.0 


9.2e-006 


3.2417c+002 


100 


165.2 


1.6e-007 


0.04 0.4 


4.5701e+002 


19 


29.4 


3.1c-004 


4.5700e+002 


100 


168.4 


1.7e-007 



We then further compare ADMM and PGADM on synthetic data with the continuation scheme for fj, 
discussed above. We terminated both ADMM and PGADM when infeas < 10~ 5 . We reported the results 
in Table [El 

From Table I5T21 we see that the continuation scheme used really helped to speed up the convergence and 
produced much better results. Also, using this continuation scheme, PGADM was faster than ADMM with 
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comparable residuals and objective function values. However, we should remark that PGADM was faster 
than ADMM using the specific continuation scheme. If other continuation schemes were adopted, the results 
could be quite different. In the comparison with LogdetPPA in the following sections, we only compare 
LogdctPPA with PGADM with this continuation scheme. 

Table 5.2 

Comparison of ADMM with PGADM (with continuation for /i) on synthetic data 





PGADM 


ADMM 


a /3 


obj 


iter 


cpu 


resid 


obj 


iter 


cpu 


resid 


0.005 0.025 


-1.711329c+002 


32 


51.5 


5.7e-006 


-1.711334c+002 


61 


103.7 


6.7e-006 


0.005 0.05 


-9.348245c+001 


41 


65.8 


4.2e-006 


-9.348589c+001 


62 


103.2 


6.6e-006 


0.010 0.05 


-4.476323c+001 


41 


67.8 


2.8e-006 


-4.476499c+001 


62 


105.7 


9.8e-006 


0.010 0.10 


5.456790c+001 


41 


65.7 


5.1e-006 


5.456724c+001 


71 


119.6 


6.0e-006 


0.020 0.10 


1.188077c+002 


41 


64.9 


3.4e-006 


1.188054c+002 


71 


115.1 


8.1c-006 


0.020 0.20 


2.371659c+002 


45 


76.4 


8.7e-006 


2.371687c+002 


75 


125.6 


7.0e-006 


0.040 0.20 


3.241688c+002 


44 


72.7 


8.3e-006 


3.241684e+002 


75 


126.6 


8.5e-006 


0.040 0.40 


4.570019c+002 


50 


77.5 


7.5e-006 


4.570058e+002 


78 


126.5 


9.7e-006 



5.2. Comparison of PGADM and LogdetPPA on Synthetic Data. In this section, we compare 
PGADM with LogdetPPA on synthetic data created the same way as in the last section. LogdetPPA, 
proposed by Wang et al. in |52j , is a proximal point algorithm for solving semidefmite programming problems 
with logdet(-) function. The specialized MATLAB codes of LogdetPPA for solving (|2.2p were downloaded 



from http://ssg.mit.edu/~venkatc/latent-variable-code.html 



We compared PGADM (with continuation on fj,) with LogdetPPA with different a and /3. We reported 
the comparison results on objective function value, CPU time, sparsity of S and infeas in Table 15.31 The 
sparsity of S is denoted as 

._ #{(»,j):^/0} 
sp . „ , 

i.e., the percentage of nonzero entries. Since matrix S generated by LogdetPPA is always dense but with 
many small entries, we also measure its sparsity by truncating small entries that less than 10~ 4 to zeros, i.e., 

. = #{(i,j):|5 i3 -|>10- 4 } 
p 2 

All CPU times reported are in seconds. We report the speed up of PGADM over LogdetPPA in Table I5T41 
From Table 15.31 we see that the solutions produced by PGADM always have comparable objective 
function values compared to the solutions produced by LogdetPPA. However, our PGADM is always much 
faster than LogdetPPA, as shown in both Tables [531 and I5T41 In fact, PGADM is usually ten times faster 
than LogdetPPA, and sometimes more than thirty five times faster. For example, for the four large problems 
with matrices size 2000 x 2000, LogdetPPA needs 1 hour 23 minutes, 1 hour 26 minutes, 1 hour 47 minutes 
and 1 hour 20 minutes, respectively, to solve them, while our PGADM needs about 7 minutes, 7 minutes, 8 
minutes and 9 minutes respectively to solve them. We also notice that the matrix S generated by PGADM 
is always a sparse matrix with many entries exactly equal to zero, but S generated by LogdetPPA is always 
a dense matrix, and only when we truncate the entries that are smaller than 10~ 4 to zeros, it becomes a 
sparse matrix with similar level of sparsity. This is because in our PGADM, S is updated by the l\ shrinkage 
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Table 5.3 

Results of PGADM and LogdetPPA on synthetic data 



dim 


LogdetPPA 


PGADM 


P 


obj 


cpu 


s P (%) 


spl (%) 


obj 


cpu 


SP (%) 


infeas 


a = 0.005, P = 0.025 


200 


1.914315c+2 


7.2 


100.00 


19.17 


1.910379c+2 


1.0 


18.90 


4.3e-6 


500 


1.898418c+2 


235.2 


100.00 


5.78 


1.898275c+2 


7.3 


5.63 


7.2e-6 


1000 


-1.711293c+2 


1706.0 


100.00 


0.52 


-1.711329c+2 


48.2 


0.49 


5.7e-6 


2000 


-1.430010c+3 


5001.2 


100.00 


0.06 


-1.435605c+3 


427.2 


0.05 


4.9e-6 


a = 0.005, /3 = 0.05 


200 


1.926376c+2 


29.1 


100.00 


43.63 


1.924829c+2 


2.5 


48.66 


6.6e-6 


500 


2.051884c+2 


358.3 


100.00 


12.04 


2.051425c+2 


10.5 


11.42 


8.8e-6 


1000 


-9.347297c+l 


1076.4 


100.00 


4.88 


-9.348245c+l 


71.6 


4.72 


4.2e-6 


2000 


-1.2293230+3 


5191.5 


100.00 


0.21 


-1.230238C+3 


445.0 


0.15 


9.5e-6 


a = 0.01, P = 0.05 


200 


2.030390c+2 


20.8 


100.00 


20.46 


2.026586c+2 


2.6 


11.06 


9.0e-6 


500 


2.394720c+2 


146.0 


100.00 


4.25 


2.394631c+2 


10.7 


4.14 


3.7e-6 


1000 


-4.476078c+l 


740.6 


100.00 


0.24 


-4.476323c+l 


72.8 


0.23 


2.8e-6 


2000 


-1.101454C+3 


6433.4 


100.00 


0.05 


-1.111504C+3 


453.7 


0.05 


6.4e-6 


a = 0.01, £ = 0.1 


200 


2.050879e+2 


29.1 


100.00 


42.16 


2.048359c+2 


1.6 


35.83 


7.0e-6 


500 


2.639548c+2 


235.2 


100.00 


8.62 


2.638565c+2 


11.6 


8.19 


5.8e-6 


1000 


5.456825c+l 


932.1 


100.00 


2.26 


5.456790c+l 


74.9 


2.26 


5.1e-6 


2000 


-8.541802c+2 


4813.6 


100.00 


0.14 


-8.712916c+2 


516.6 


0.07 


8.6e-6 



operation, which truncates the small entries to zeros, while LogdetPPA needs to replace ||5||i with smooth 
linear function which does not preserve sparsity. 

Table 5.4 

Speed up of PGADM over LogdetPPA on synthetic data. 



p 


a = 0.005, P = 0.025 


a = 0.005, P = 0.05 


a = 0.01, P = 0.05 


a = 0.01, P = 0.1 


200 


7.1 


11.7 


7.9 


18.7 


500 


32.0 


34.2 


13.6 


20.3 


1000 


35.4 


15.0 


10.2 


12.4 


2000 


11.7 


11.7 


14.2 


9.3 



5.3. Comparison of PGADM and LogdetPPA on Gene expression data. To further demon- 
strate the efficacy of PGADM, we applied PGADM to solving (|2.2j) with two gene expression data sets. One 
data set is the Rosetta Inpharmatics Compendium of gene expression data (denoted as Rosetta) [29] profiles 
which contains 301 samples with 6316 variables (genes). The other data set is the Iconix microarray data 
set (denoted as Iconix) from drug treated rat livers [H] which contains 255 samples with 10455 variables. 

For a given number of observed variables p, we created the sample covariance matrix Ex by the following 
procedure. We first computed the variances of all of variables using all the sample data. We then selected the 
p variables with the highest variances and computed the sample covariance matrix Sjf of these p variables 
using all the sample data. We reported the comparison results of PGADM and LogdetPPA in Tables 15.51 
and 15.61 for the Rosetta and Iconix data sets, respectively. Table 15.71 summarizes the speed up of PGADM 
over LogdetPPA. 
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Table 5.5 

Results of PGADM and LogdetPPA on Rosetta data set 



dim 


LogdetPPA 


PGADM 


P 


obj 


cpu 


SP (%) 


spl (%) 


obj 


cpu 


s P (%) 


infeas 


200 


-2.726188c+2 


5.6 


100.00 


0.58 


-2.726184c+2 


0.7 


0.58 


7.1e-6 


500 


-8.662116c+2 


68.0 


100.00 


0.20 


-8.6621130+2 


7.9 


0.20 


3.7e-6 


1000 


-1.970974C+3 


490.9 


100.00 


0.10 


-1.970973c+3 


52.2 


0.10 


7.7e-6 


2000 


-4.288406c+3 


4597.9 


100.00 


0.05 


-4.288406c+3 


422.3 


0.05 


5.4e-6 



Table 5.6 

Results of PGADM and LogdetPPA on Iconix data set 



dim 


LogdetPPA 


PGADM 


P 


obj 


cpu 


SP (%) 


spl (%) 


obj 


cpu 


sp (%) 


infeas 


200 


1.232884c+3 


38.5 


100.00 


36.10 


1.232744e+3 


2.5 


41.62 


7.2c-6 


500 


1.842623c+3 


98.8 


100.00 


2.27 


1.839838e+3 


17.0 


0.77 


l.Oc-5 


1000 


1.439052c+3 


1341.9 


100.00 


1.45 


1.435425c+3 


94.6 


0.14 


6.3c-6 


2000 


1.242966c+2 


13207.2 


100.00 


0.07 


1.168757c+2 


738.2 


0.06 


8.5c-6 



From Table [531 we again see that PGADM always generates solutions with comparable objective function 
values in much less time. For example, for p = 2000, LogdetPPA needs 1 hour 16 minutes to solve it while 
PGADM takes just 7 minutes. From Table 15. 6[ we see that the advantage of PGADM is more obvious. 
For p = 200, 500, 1000 and 2000, PGADM always generates solutions with much smaller objective function 
values and it is always much faster than LogdetPPA. For example, for p = 2000, LogdetPPA takes 3 hours 
40 minutes to solve it while PGADM just takes about 12 minutes. 

Table 5.7 

Speed up of PGADM over LogdetPPA on Rosetta and Iconix data sets. 



dim 


Rosetta data 


Iconix data 


200 


8.0 


15.4 


500 


8.6 


5.8 


1000 


9.4 


14.2 


2000 


10.9 


17.9 



6. Conclusion. In this paper, we proposed alternating direction methods for solving latent variable 
Gaussian graphical model selection. The global convergence results of our methods were established. We 
applied the proposed methods for solving large problems from both synthetic data and gene expression data. 
The numerical results indicated that our methods were five to thirty five times faster than a state-of-the-art 
Newton-CG proximal point algorithm. 
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Appendix A. Global Convergence Analysis of PGADM. 

In this section, we establish the global convergence result of PGADM (Algorithm Q}. This convergence 
proof is not much different with the one given by |57j for compressed sensing problems. We include the proof 
here just for completeness. 

(s\ 

We introduce some notation first. We define W = II- We define functions F(-) and G(-) as 

F(R) := (fi, t x ) - log det.fi, 

and 

G(W) := a\\S\\i+PTr(L)+l(L t 0). 

Note that both F and G are convex functions. We also define matrix A as A = [— I p xp, Ipxp] S M. px2p . Now 
Problem (12.21) can be rewritten as 



(A.l) min F(R) + G(W), s.t. fi + AW = 0, 

and our PGADM (Algorithm [lj can be rewritten as 



(A.2) { W k+1 
A fe+i 



rk 112 



R k+i ._ &Tgm i nR F(R) + G{W k )- (A k ,R + AW k ) + ^ 1 \\R + AW k \ lF 



= argmin^ F{R k+1 ) + G(W) + ^\\W - (W k - tA j ' (R k+1 + AW k - fiA k )) 
= A k - (R k+1 +AW k+1 )/ii. 



Before we prove the global convergence result, we need to prove the following lemma. 

Lemma A.l. Assume that (R* , W*) is an optimal solution of (|A.lj) and A* is the corresponding optimal 
dual variable associated with the equality constraint R + AW — 0. Assume the step size r of the proximal 
gradient step satisfies < r < 1/2. Then there exists r) > such that the sequence (R k ,W k , A k ) produced 
by (|A.2[) satisfies 



(A.3) \\U k - U*\\ 2 H - \\U k+1 - U*\\ 2 H > n\\U k - U k ^ 



where U* = ( 4 ^ ] , U k — ( fc ] and H = ( MT ^ Xp t J , and the norm || • \\ 2 H is defined as \\U\\ 2 H 



W *),U k = ( Wk \andH= ^ ° 
A* J \A k J \ iil pxp/ 

(U,HU) and the corresponding inner product {-,-)h is defined as {U,V)h = (U,HV). 

Proof. Since (fi*,W*,A*) is optimal to (|A.1[) . it follows from the KKT conditions that the followings 
hold: 

(A.4) 6 <9F(fi*) - A*, 



(A.5) 0edG(W*)- A T A*, 
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and 

(A.6) = R*+AW*. 

Note that the first-order optimality conditions for the first sub-problem (i.e., the subproblcm with respect 
to R) in (IA.2I) are given by 

(A.7) G dF(R k+1 ) -A k + -(R k+1 + AW k - 0). 

M 

By using the updating formula for A fc , i.e., 

(A.8) A k+1 = A k - {R k+1 + AW k+1 )/n, 

()A.7| can be reduced to 

(A.9) G dF(R k+1 ) - A k+1 - -(AW k+1 - AW k ). 

A* 

Combining (|A.4[) and (|A.9|) and using the fact that dF(-) is a monotone operator, we get 

(A.10) (R k+1 - R*,A k+1 - A* + -(AW k+1 - AW k )) > 0. 

The first-order optimality conditions for the second subproblcm (i.e., the subproblem with respect to 
W) in (|A.2[) are given by 

(A.ll) G dG(W k+1 ) + — (W k+1 - (W k - tA t (AW k + R k+1 - fxA k ))). 

Using (|A.8|) . (jA.lip can be reduced to 

(A.12) G dG{W k+1 ) + — {W k+1 -W k + rA T (AW k - AW k+1 - ^A k+1 )). 

(J.T 

Combining (|A.5P and ()A.12p and using the fact that dG(-) is a monotone operator, we get 
(A.13) (W k+1 - W\ —(W k - W k+1 ) - -A T {AW k - AW k+1 ) + A T A k+1 - A T A*) > 0. 

Summing ()A~T0l) and (|A~T3]l . and using R* = -AW* and R k+1 = fi(A k - A k+1 ) - AW k+1 , we obtain, 
(A.14) — (W k+1 - W*,W k -W k+l ) +/i(A fc+1 - A*,A fe - A fe+1 ) > (A k - A k+1 ,AW k - AW k+l ). 

[IT 

Using the notation of U k , U* and H, (|A.14[) can be rewritten as 

(A.15) (U k+1 -U*,U k - U k+1 ) H > (A k - A k+ \ AW k - AW k+1 ), 

which can be further written as 

(A.16) (U k -U*,U k -U k+l ) H > \\U k - U k+1 \\ H + (A k - A k+1 ,AW k - AW k+1 ). 
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Combining (|A.16[) with the identity 



\\U k+1 - U*\\ 2 H = \\U k+1 - U k \\ 2 H - 2(U k - U k+ \ U k - U*) H + \\U k - U*\\ 2 H , 

we get 

\\U k -U*f H -\\U k+1 -U*f H 
(A.17) = 2(U k - U k+ \U k - U*) H - \\U k+l - U k \\ 2 H 

> \\U k+1 - U k \\ 2 H + 2(A k - A k+1 ,AW k - AW k+1 ). 

Let £ := r + 1/2, then we know that 2r < £ < 1 since < r < 1/2. Let p := Then from 
Cauchy-Schwartz inequality we have 

2(A k - A k+1 ,AW k -AW k+l ) > - P \\A k - A k+1 \\ 2 - )\\AW h ~ AW k+1 \\ 2 
(A.18) > -p\\A k - A k+1 \\ 2 - ±\ m!iK (A T A)\\W k -W k+1 \\ 2 

= -p\\A k -A k+1 \\ 2 -j\\W k ~W k+1 \\ 2 , 

where the X max (A T A) denotes the largest eigenvalue of matrix A T A and the equality is due to the fact that 
A max (A T A) = 2. Combining (|A.17|) and (|A.18|) we get 

\\U k -U*f H -\\U k ^-U*\\% > (J--l)\\w k -W k +i\\ 2 + (p-p)\\A k -A k +i\\ 2 

> ^i^.^fc+iii^ 

where n := min{l — 1 — ^} = min{l — 1 — £} > 0. This completes the proof. □ 
We are now ready to give the main convergence result of Algorithm (|A.2[) . 

Theorem A. 2. The sequence {(R k ,W k , A k )} produced by Algorithm (|A.2|) from any starting point 
converges to an optimal solution to Problem (jA.lj) . 
Proof. From Lemma IA.1I we can easily get that 
. (i) \\U k -U k+1 \\ H -+0; 

• (ii) {U k } lies in a compact region; 

• (hi) \\U k — U*\\jj is monotonically non-increasing and thus converges. 

It follows from (i) that A fe - A fe+1 — » and W k - W k+1 -4 0. Then ()A~8|) implies that R k - R k+l -> and 
R k + AW k — > 0. From (ii) we obtain that, U k has a subsequence {U kj } that converges to U = (W,A), i.e., 
A k > -> A and W fc * -» W". From + AW* -> we also get that R k i R := -AW. Therefore, (R, W, A) 
is a limit point of {(R k 1 W k ,A k )}. 
Note that (|A.9[) implies that 

(A.20) e dF(R) - A. 

Note also that (|A.12j) implies that 

(A.21) e dG(W) - A T A. 

(|AT20]l , (|A~2T|) and R + AW = imply that (R, W,A) satisfies the KKT conditions for (|A~T|) and thus is 
an optimal solution to (|A.1[) . Therefore, we showed that any limit point of {(R k ,W k , A k )} is an optimal 
solution to (|A.1[) . 
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To complete the proof, we need to show that the limit point is unique. Let {(Ri, W\, Ai)} and 
{(-R2, W2, A 2 )} be any two limit points of {(R k ,W k ,A k )}. As wc have shown, both {(Ri,Wi,K.i)} and 
{(R 2 , W 2 , A 2 )} are optimal solutions to (|A.lj) . Thus, U* in (|A.19[) can be replaced by Ui := (Ri,Wi,Ai) 
and U 2 (R2,W 2 ,A 2 ). This results in 

||C/ fc+1 -^|^<||f/ fe -^||ff, * = 1,2, 
and we thus get the existence of the limits 

lim ||l/ fe -C7' i || H = » h <+oo, i = 1,2. 

Now using the identity 

\\u k - u.wl - \\u k - u 2 \\% = -2([/ fc ,^ - u 2 ) H + wu.wl - \\u 2 \\% 

and passing the limit we get 

ri - v 2 2 = -2{tli,Ui - U2)h + \\Ui\\ 2 h - \\U 2 \\ 2 H = -||t\ - U 2 f H 

and 

vi - = -202, Ui - U 2 )h + \\Ui\\% - \\u 2 \\% = \\u 1 - u 2 \\%. 

Thus wc must have \\Ui — C^2||# = and hence the limit point of {(R k ,W k , A k )} is unique. □ 

We now immediately have the global convergence result for Algorithm Q] for solving Problem (|2.2D . 
Corollary A. 3. The sequence {(R k , S k , L k , A k )} produced by Algorithm [I] from any starting point 

converges to an optimal solution to Problem 
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